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1. IIITRODUCTIOII 

There are two primary methods in conventional computing for the execution 
of algorithms. The first is to use hardwired technology, either an 
Application Specific Integrated Circuit (ASIC) or a group of individual 
components forming a board-level solution, to perform the operations in 
hardware. ASICs are designed specifically to perform a given computation, 
and thus they are very fast and efficient when executing the exact 
computation for which they were designed. However, the circuit cannot be 
altered after fabrication. This forces a redesign and ref abrication of the 
chip if any part of its circuit requires modification. This is an expensive 
process, especially when one considers the difficulties in replacing ASICs 
in a large number of deployed systems. Board-level circuits are also 
somewhat inflexible, frequently requiring a board redesign and replacement 
in the event of changes to the application. 

The second method is to use software-programmed microprocessors — a 
far more flexible solution. Processors execute a set of instructions to 
perform a computation. By changing the software instructions, the 
functionality of the system is altered without changing the hardware. 
However, the downside of this flexibility is that the performance can 
suffer, if not in clock speed then in work rate, and is far below that of 
an ASIC. The processor must read each instruction from memory, decode its 
meaning, and only then execute it. This results in a high execution 
overhead for each individual operation. Additionally, the set of 
instructions that may be used by a program is determined at the fabrication 
time of the processor. Any other operations that are to be implemented must 
be built out of existing instructions. 

Reconfigurable computing is intended to fill the gap between hardware 
and software, achieving potentially much higher performance than software, 
while maintaining a higher level of flexibility than hardware. 
Reconfigurable devices, including field-programmable gate arrays (FPGAs), 
contain an array of computational elements whose functionality is 
determined through multiple programmable configuration bits. These 
elements, sometimes known as logic blocks, are connected using a set of 
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routing resources that are also programmable. In this way, custom digital 
circuits can be mapped to the reconf igurable hardware by computing the 
logic functions of the circuit within the logic blocks, and using the 
configurable routing to connect the blocks together to form the necessary 
circuit . 

FPGAs and reconf igurable computing have been shown to accelerate a 
variety of applications. Data encryption, for example, is able to leverage 
both parallelism and fine-grained data manipulation. An implementation of 
the Serpent Block Cipher in the Xilinx Virtex XCV1000 shows a throughput 
increase by a factor of over 18 compared to a Pentium Pro PC running at 200 
MHz (Elbirt and Paar 2000) . Additionally, a reconf igurable computing 
implementation of sieving for factoring large numbers (useful in breaking 
encryption schemes) was accelerated by a factor of 28 over a 200-MHz 
UltraSparc workstation (Kim and Mangione-Smith 2000) . The Garp architecture 
shows a comparable speed-up for DES (Hauser and Wawrzynek 1997), as does an 
FPGA implementation of an elliptic curve cryptography application (Leung et 
al. 2000) . 

Other recent applications that have been shown to exhibit significant 
speedups using reconf igurable hardware include: automatic target 
recognition (Rencher and Hutchings 1997), string pattern matching 
(Weinhardt and Luk 1999), Golomb Ruler Derivation (Dollas et al . 1998; 
Sotiriades et al. 2000), transitive closure of dynamic graphs (Huelsbergen 
2000), Boolean satisfiability (Zhong et al. 1998), data compression (Huang 
et al . 2000), and genetic algorithms for the travelling salesman problem 
(Graham and Nelson 1996) . 

In order to achieve these performance benefits, yet support a wide 
range of applications, reconf igurable systems are usually formed with a 
combination of reconf igurable logic and a general-purpose microprocessor. 
The processor performs the operations that cannot be done efficiently in 
the reconf igurable logic, such as data-dependent control and possibly 
memory accesses, while the computational cores are mapped to the 
reconf igurable hardware. This reconf igurable logic can be composed of 
either commercial FPGAs or custom configurable hardware. 

Compilation environments for reconf igurable hardware range from tools 
to assist a programmer in performing a hand mapping of a circuit to the 
hardware, to complete automated systems that take a circuit description in 
a high-level language to a configuration for a reconf igurable system. The 
design process involves first partitioning a program into 
sections to be implemented on hardware, and those which are to 
be implemented in software on the host processor. The computations destined 
for the reconf igurable hardware are synthesized into a gate level or 
register transfer level circuit description. This circuit is mapped onto 
the logic blocks within the reconf igurable hardware during the technology 
mapping phase. These mapped blocks are then placed into the specific 
physical blocks within the hardware, and the pieces of the circuit are 
connected using the reconf igurable routing. After compilation, the circuit 
is ready for configuration onto the hardware at run-time. These steps, when 
performed using an automatic compilation system, require very little effort 
on the part of the programmer to utilize the reconf igurable hardware. 
However, performing some or all of these operations by hand can result in a 
more highly optimized circuit for performance-critical applications. 

Since FPGAs must pay an area penalty because of their 
reconf igurability, device capacity can sometimes be a concern. Systems that 
are configured only at power-up are able to accelerate only as much of the 
program as will fit within the programmable structures. Additional areas of 
a program might be accelerated by reusing the reconf igurable hardware 
during program execution. This process is known as run-time reconfiguration 
(RTR) . While this style of computing has the benefit of allowing for the 
acceleration of a greater portion of an application, it also introduces the 
overhead of configuration, which limits the amount of acceleration 
possible. Because configuration can take milliseconds or longer, rapid and 
efficient configuration is a critical issue. Methods such as configuration 
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compression and the partial reuse of already programmed configurations can 
be used to reduce this overhead. 

This article presents a survey of current research in hardware and 
software systems for reconf igurable computing, as well as techniques that 
specifically target run-time reconf igurability . We lead off this discussion 
by examining the technology required for reconf igurable computing, followed 
by a more in-depth examination of the various hardware structures used in 
reconf igurable systems. Next, we look at the software required for 
compilation of algorithms to configurable computers, and the trade-offs 
between hand-mapping and automatic compilation. Finally, we discuss 
run-time reconf igurable systems, which further utilize the intrinsic 
flexibility of configurable computing platforms by optimizing the hardware 
not only for different applications, but for different operations within a 
single application as well. 

This survey does not seek to cover every technique and research 
project in the area of reconf igurable computing. Instead, it hopes to serve 
as an introduction to this rapidly evolving field, bringing interested 
readers quickly up to speed on developments from the last half-decade. 
Those interested in further background can find coverage of older 
techniques and systems elsewhere (Rose et al . 1993; Hauck and Agarwal 1996; 
Vuillemin et al . 1996; Mangione-Smith et al . 1997; Hauck 1998b). 

2 . TECHNOLOGY 

Reconf igurable computing as a concept has been in existence for quite 
some time (Estrin et al . 1963) . Even general-purpose processors use some of 
the same basic ideas, such as reusing computational components for 
independent computations, and using multiplexers to control the routing 
between these components. However, the term reconf igurable computing, as it 
is used in current research (and within this survey) , refers to systems 
incorporating some form of hardware programmability — customizing how the 
hardware is used using a number of physical control points. These control 
points can then be changed periodically in order to execute different 
applications using the same hardware. 

The recent advances in reconf igurable computing are for the most part 
derived from the technologies developed for FPGAs in the mid-1980s. FPGAs 
were originally created to serve as a hybrid device between PALs and 
Mask-Programmable Gate Arrays (MPGAs) . Like PALs, FPGAs are fully 
electrically programmable, meaning that the physical design costs are 
amortized over multiple application circuit implementations, and the 
hardware can be customized nearly instantaneously. Like MPGAs, they can 
implement very complex computations on a single chip, with devices 
currently in production containing the equivalent of over a million gates. 
Because of these features, FPGAs had been primarily viewed as glue-logic 
replacement and rapid-prototyping vehicles. However, as we show 
throughout this article, the flexibility, capacity, and performance of 
these devices has opened up completely new avenues in high-performance 
computation, forming the basis of reconf igurable computing. 

Most current FPGAs and reconf igurable devices are SRAM-programmable 
(Figure 1 left), meaning that SRAM (1) bits are connected to the 
configuration points in the FPGA, and programming the SRAM bits configures 
the FPGA. Thus, these chips can be programmed and reprogrammed about as 
easily as a standard static RAM. In fact, one research project, the PAM 
project (Vuillemin et al . 1996), considers a group of one or more FPGAs to 
be a RAM unit that performs computation between the memory write (sending 
the configuration information and input data) and memory read (reading the 
results of the computation) . This leads some to use the term Programmable 
Active Memory or PAM. 

(FIGURE 1 OMITTED) 

One example of how the SRAM configuration points can be used is to 
control routing within a reconf igurable device (Chow et al . 1999a). To 
configure the routing on an FPGA, typically a passgate structure is 
employed (see Figure 1 right) . Here the programming bit will turn on a 
routing connection when it is configured with a true value, allowing a 
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signal to flow from one wire to another, and will disconnect these 
resources when the bit is set to false. With a proper interconnection of 
these elements, which may include millions of routing choice points within 
a single device, a rich routing fabric can be created. 

Another example of how these configuration bits may be used is to 
control multiplexers, which will choose between the output of different 
logic resources within the array. For example, to provide optional 
stateholding elements a D flip-flop (DFF) may be included with a 
multiplexer selecting whether to forward the latched or unlatched signal 
value (see Figure 2 left) . Thus, for systems that require state-holding the 
programming bits controlling the multiplexer would be configured to select 
the DFF output, while systems that do not need this function would choose 
the bypass route that sends the input directly to the output. Similar 

fixed-logic computation elements, memories, carry chains, or other 

(FIGURE 2 OMITTED) 

Finally, the configuration bits may be used as control signals for a 
computational unit or as the basis for computation itself As a control 
signal, a configuration bit may determine whether an ALU performs an 
addition, subtraction, or other logic computations. On the other hand, with 
a structure such as a lookup table (LUT), the configuration bits themselves 
form the result of the computation (see Figure 2 right) . These elements are 
essentially small memories provided for computing arbitrary logic 
functions. LUTs can compute any function of N inputs (where N is the number 
of control signals for the LUT ' s multiplexer) by programming the 2N 
programming bits with the truth table of the desired function. Thus, if all 
programming bits except the one corresponding to the input pattern 111 were 
set to zero a 3-input LUT would act as a 3-input AND gate, while 
programming it with all ones except in 00 0 would compute a NAND . 

3 . HARDWARE 

Reconf igurable computing systems use FPGAs or other programmable 
hardware to accelerate algorithm execution by mapping compute-intensive 
calculations to the reconf igurable substrate. These hardware resources are 
frequently coupled with a general-purpose microprocessor that is 
responsible for controlling the reconf igurable logic and executing program 
code that cannot be efficiently accelerated. In very closely coupled 
systems, the reconf igurability lies within customizable functional units on 
the regular datapath of the microprocessor. On the other hand, a 
reconf igurable computing system can be as loosely coupled as a networked 
standalone unit . Most reconf igurable systems are categorized somewhere 
between these two extremes, frequently with the reconf igurable hardware 
acting as a coprocessor to a host microprocessor. The programmable array 
itself can be comprised of one or more commercially available FPGAs, or can 
be a custom device designed specifically for reconf igurable computing. 

The design of the actual computation blocks within the reconf igurable 
hardware varies from system to system. Each unit of computation, or logic 
block, can be as simple as a 3-input lookup table (LUT) , or as complex as a 
4-bit ALU. This difference in block size is commonly referred to as the 
granularity of the logic block, where the 3-bit LUT is an example of a very 
fine-grained computational element, and a 4-bit ALU is an example of a 
quite coarse-grained unit. The finer-grained blocks are useful for 
bit-level manipulations, while the coarse-grained blocks are better 
optimized for standard datapath applications. Some architectures employ 
different sizes or types of blocks within a single reconf igurable array in 
order to efficiently support different types of computation. For example, 
memory is frequently embedded within the reconf igurable hardware to provide 
temporary data storage, forming a heterogeneous structure composed of both 
logic blocks and memory blocks (Ebeling et al . 1996; Altera 1998; Lucent 
1998; Marshall et al . 1999; Xilinx 1999). 

The routing between the logic blocks within the reconf igurable 
hardware is also of great importance. Routing contributes significantly to 
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the overall area of the reconf igurable hardware. Yet, when the percentage 
of logic blocks used in an FPGA becomes very high, automatic routing tools 
frequently have difficulty achieving the necessary connections between the 
blocks. Good routing structures are therefore essential to ensure that a 
design can be successfully placed and routed onto the reconf igurable 
hardware . 

Once a circuit has been programmed onto the reconf igurable hardware, 
it is ready to be used by the host processor during program execution. The 
run-time operation of a reconf igurable system occurs in two distinct 
phases: configuration and execution. The programming of the reconf igurable 
hardware is under the control of the host processor. This host processor 
directs a stream of configuration data to the reconf igurable hardware, and 
this configuration data is used to define the actual operation of the 
hardware. Configurations can be loaded solely at start-up of a program, or 
periodically during runtime, depending on the design of the system. More 
concepts involved in run-time reconfiguration (the dynamic reconfiguration 
of devices during computation execution) are discussed in a later 

The actual execution model of the reconf igurable hardware varies from 
system to system. For example, the NAPA system (Rupp et al . 1998) by 
default suspends the execution of the host processor during execution on 
the reconf igurable hardware. However, simultaneous computation can occur 
with the use of f ork-and- join primitives, similar to multiprocessor 
programming. REMARC (Miyamori and Olukotun 19 98) is a reconf igurable system 
that uses a pipelined set of execution phases within the reconf igurable 
hardware. These pipeline stages overlap with the pipeline stages of the 
host processor, allowing for simultaneous execution. In the Chimaera system 
(Hauck et al . 1997), the reconf igurable hardware is constantly executing 
based upon the input values held in a subset of the host processor's 
registers. A call to the Chimaera unit is in actuality only a fetch of the 
result value. This value is stable and valid after the correct input values 
have been written to the registers and have filtered through the 
computation . 

In the next sections, we consider in greater depth the 
hardware issues in reconf igurable computing, including both logic and 
routing. To support the computation demands of reconf igurable computing, we 
consider the logic block architectures of these devices, including possibly 
the integration of heterogeneous logic resources within a device. 
Heterogeneity also extends between chips, where one of the most important 
concerns is the coupling of the reconf igurable logic with standard, 
general-purpose processors. However, reconf igurable devices are more than 
just logic devices; the routing resources are at least as important as 
logic resources, and thus we consider interconnect structures, including 
ID-oriented devices that are beginning to appear. 

3.1. Coupling 

Frequently, reconf igurable hardware is coupled with a traditional 
microprocessor. Programmable logic tends to be inefficient at implementing 
certain types of operations, such as variable-length loops and branch 
control. In order to run an application in a reconf igurable computing 
system most efficiently, the areas of the program that cannot be easily 
mapped to the reconf igurable logic are executed on a host microprocessor. 
Meanwhile, the areas with a high density of computation that can benefit 
from implementation in hardware are mapped to the reconf igurable logic. For 
the systems that use a microprocessor in conjunction with reconf igurable 
logic, there are several ways in which these two computation structures may 
be coupled, as Figure 3 shows. 

(FIGURE 3 OMITTED) 

First, reconf igurable hardware can be used solely to provide 
reconf igurable functional units within a host processor (Razdan and Smith 
1994; Hauck et al . 1997) . This allows for a traditional programming 
environment with the addition of custom instructions that may change over 
time. Here, the reconf igurable units execute as functional units on the 
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main microprocessor datapath, with registers used to hold the input and 
output operands . 

Second, a reconf igurable unit may be used as a coprocessor (Wittig 
and Chow 1996; Hauser and Wawrzynek 1997; Miyamori and Olukotun 1998; Rupp 
et al . 1998; Chameleon 2000). A coprocessor is, in general, larger than a 
functional unit, and is able to perform computations without the constant 
supervision of the host processor. Instead, the processor initializes the 
reconf igurable hardware and either sends the necessary data to the logic, 
or provides information on where this data might be found in memory. The 
reconf igurable unit performs the actual computations independently of the 
main processor, and returns the results after completion. This type of 
coupling allows the reconf igurable logic to operate for a large number of 
cycles without intervention from the host processor, and generally permits 
the host processor and the reconf igurable logic to execute simultaneously. 
This reduces the overhead incurred by the use of the reconf igurable logic, 
compared to a reconf igurable functional unit that must communicate with the 
host processor each time a reconf igurable "instruction" is used. One idea 
that is somewhat of a hybrid between the first and second coupling methods, 
is the use of programmable hardware within a configurable cache (Kim et al. 
2000) . In this situation, the reconf igurable logic is embedded into the 
data cache. This cache can then be used as either a regular cache or as an 
additional computing resource depending on the target application. 

Third, an attached reconf igurable processing unit (Vuillemin et al . 
1996; Annapolis 1998; Laufer et al . 1999) behaves as if it is an additional 
processor in a multiprocessor system or an additional compute engine 
accessed semif requently through external I/O. The host processor's data 
cache is not visible to the attached reconf igurable processing unit. There 
is, therefore, a higher delay in communication between the host processor 
and the reconf igurable hardware, such as when communicating configuration 
information, input data, and results. This communication is performed 
though specialized primitives similar to multiprocessor systems. However, 
this type of reconf igurable hardware does allow for a great deal of 
computation independence, by shifting large chunks of a computation over to 
the reconf igurable hardware. 

Finally, the most loosely coupled form of reconf igurable hardware is 
that of an external stand-alone processing unit (Quickturn 1999a, 1999b) . 
This type of reconf igurable hardware communicates infrequently with a host 
processor (if present). This model is similar to that of networked 
workstations, where processing may occur for very long periods of time 
without a great deal of communication. In the case of the Quickturn 
systems, however, this hardware is geared more towards emulation than 
reconf igurable computing. 

Each of these styles has distinct benefits and drawbacks. The tighter 
the integration of the reconf igurable hardware, the more frequently it can 
be used within an application or set of applications due to a lower 
communication overhead. However, the hardware is unable to operate for 
significant portions of time without intervention from a host processor, 
and the amount of reconf igurable logic available is often quite limited. 
The more loosely coupled styles allow for greater parallelism in program 
execution, but suffer from higher communications overhead. In applications 
that require a great deal of communication, this can reduce or remove any 
acceleration benefits gained through this type of reconf igurable hardware. 

3.2. Traditional FPGAs 

Before discussing the detailed architecture design of reconf igurable 
devices in general, we will first describe the logic and routing of FPGAs. 
These concepts apply directly to reconf igurable systems using commercial 
FPGAs, such as PAM (Vuillemin et al . 1996) and Splash 2 (Arnold et al . 
1992; Buell et al . 1996), and many also extend to architectures designed 
specifically for reconf igurable computing. Hardware concepts applying 
specifically to architectures designed for reconf igurable computing, as 
well as variations on the generic FPGA description provided here, are 
discussed following this section. More detailed surveys of FPGA 
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architectures themselves can be found elsewhere (Brown et al . 1992a; Rose 
et al . 1993) . 

Since the introduction of FPGAs in the mid-1980s, there have been 
many different investigations into what computation element (s) should be 
built into the array (Rose et al . 1993). One could consider FPGAs that were 
created with PAL-like product term arrays, or multiplexer-based 
functionality, or even basic fixed functions such as simple NAND and XOR 
gates. In fact, many such architectures have been built. However, it seems 
to be fairly well established that the best function block for a standard 
FPGA, a device whose primary role is the implementation of random digital 
logic, is the one found in the first devices deployed — the lookup table 
(Figure 2 right). As described in the previous section, an 

N-input LUT is basically a memory that, when programmed appropriately, can 
compute any function of up to N inputs. This flexibility, with relatively 
simple routing requirements (each input need only be routed to a single 
multiplexer control input) turns out to be very powerful for logic 
implementation. Although it is less area-efficient than fixed logic blocks, 
such as a standard NAND gate, the truth is that most current FPGAs use less 
than 10% of their chip area for logic, devoting the majority of the silicon 
real estate for routing resources. 

The typical FPGA has a logic block with one or more 4-input LUT(s), 
optional D flip-flops (DFF) , and some form of fast carry logic (Figure 4) . 
The LUTs allow any function to be implemented, providing generic logic. The 
flip-flop can be used for pipelining, registers, stateholding functions for 
finite state machines, or any other situation where clocking is required. 
Note that the flip-flops will typically include programmable set/reset 
lines and clock signals, which may come from global signals routed on 
special resources, or could be routed via the standard interconnect 
structures from some other input or logic block. The fast carry logic is a 
special resource provided in the cell to speed up carry-based computations, 
such as addition, parity, wide AND operations, and other functions. These 
resources will bypass the general routing structure, connecting instead 
directly between neighbors in the same column. Since there are very few 
routing choices in the carry chain, and thus less delay on the computation, 
the inclusion of these resources can significantly speed up carry-based 
computations . 

(FIGURE 4 OMITTED) 

Just as there has been a great deal of experimentation in FPGA logic 
block architectures, there has been equally as much investigation into 
interconnect structures. As logic blocks have basically standardized on 
LUT -based structures, routing resources have become primarily island-style, 
with logic surrounded by general routing channels. 

Most FPGA architectures organize their routing structures as a 
relatively smooth sea of routing resources, allowing fast and efficient 
communication along the rows and columns of logic blocks. As shown in 
Figure 5, the logic blocks are embedded in a general routing structure, 
with input and output signals attaching to the routing fabric through 
connection blocks. The connection blocks provide programmable multiplexers, 
selecting which of the signals in the given routing channel will be 
connected to the logic block's terminals. These blocks also connect shorter 
local wires to longer-distance routing resources. Signals flow from the 
logic block into the connection block, and then along longer wires within 
the routing channels. At the switchboxes, there are connections between the 
horizontal and vertical routing resources to allow signals to change their 
routing direction. Once the signal has traversed through routing resources 
and intervening switchboxes, it arrives at the destination logic block 
through one of its local connection blocks. In this manner, relatively 
arbitrary interconnections can be achieved between the logic blocks in the 
system. 

(FIGURE 5 OMITTED) 

Within a given routing channel, there may be a number of different 
lengths of routing resources. Some local interconnections may only move 
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between adjacent logic blocks (carry chains are a good example of this), 
providing high-speed local interconnect . Medium length lines may run the 
width of several logic blocks, providing for some longer distance 
interconnect. Finally, longlines that run the entire chip width or height 
may provide for more global signals. Also, many architectures contain 
special "global lines" that provide high-speed, and often low-skew, 
connections to all of the logic blocks in the array. These are primarily 
used for clocks, resets, and other truly global signals. 

While the routing architecture of an FPGA is typically guite 
complex — the connection blocks and switchboxes surrounding a single logic 
block typically have thousands of programming points — they are designed to 
be able to support fairly arbitrary interconnection patterns. Most users 
ignore the exact details of these architectures and allow the automatic 
physical design tools to choose appropriate resources to use in order to 
achieve a given interconnect pattern. 

3.3. Logic Block Granularity 

Most reconf igurable hardware is based upon a set of computation 
structures that are repeated to form an array. These structures, commonly 
called logic blocks or cells, vary in complexity from a very small and 
simple block that can calculate a function of only three inputs, to a 
structure that is essentially a 4-bit ALU. Some of these block types are 
configurable, in that the actual operation is determined by a set of loaded 
configuration data. Other blocks are fixed structures, and the 
configurability lies in the connections between them. The size and 
complexity of the basic computing blocks is referred to as the block's 
granularity . 

An example of a very fine-grained logic block can be found in the 
Xilinx 6200 series of FPGAs (Xilinx 1996) . The functional unit from one of 
these cells, as shown in Figure 6, can implement any two-input function and 
some three-input functions. However, although this type of architecture is 
useful for very fine-grained bit manipulation, it can be too fine-grained 
to efficiently implement many types of circuits, such as multipliers. 
Similarly, finite state machines are frequently too complex to easily map 
to a reasonable number of very fine-grained logic blocks. However, finite 
state machines are also too dependent upon single bit values to be 
efficiently implemented in a very coarse-grained architecture. This type of 
circuit is more suited to an architecture that provides more connections 
and computational power per logic block, yet still provides sufficient 
capability for bit-level manipulation. 
(FIGURE 6 OMITTED) 

The logic cell in the Altera FLEX 10K architecture (Altera 1998) is a 
fine-grained structure that is somewhat coarser than the 6200. This 
architecture mainly consists of a single 4-input LUT with a flip-flop. 
Additionally, there is specialized carry-chain circuitry that helps to 
accelerate addition, parity, and other operations that use a carry chain. 
These types of logic blocks are useful for fine-grained bit-level 
manipulation of data, as can frequently be found in encryption and image 
processing applications. Also, because the cells are fine-grained, 
computation structures of arbitrary bit widths can be created. This can be 
useful for implementing datapath circuits that are based on data widths not 
implemented on the host processor (5 bit multiply, 18 bit addition, etc). 
Reconf igurable hardware can not only take advantage of small bit widths, 
but also large data widths. When a program uses bit widths in excess of 
what is normally available in a host processor, the processor must perform 
the computations using a number of extra steps in order to handle the full 
data width. A fine-grained architecture would be able to implement the full 
bit width in a single step, without the fetching, decoding, and execution 
of additional instructions, as long as enough logic cells are available. 

A number of reconf igurable systems use a granularity of logic block 
that we categorize as medium-grained (Xilinx 1994; Hauser and Wawrzynek 
1997; Haynes and Cheung 1998; Lucent 1998; Marshall et al . 1999). For 
example, Garp (Hauser and Wawrzynek 1997) is designed to perform a number 



9 



of different operations on up to four 2-bit inputs. Another medium-grained 
structure was designed specifically to be embedded inside of a 
general-purpose FPGA to implement multipliers of a configurable bit width 
(Haynes and Cheung 1998) . The logic block used in the multiplier FPGA is 
capable of implementing a 4 x 4 multiplication, or cascaded into larger 
structures. The CHESS architecture (Marshall et al . 1999) also operates on 
4-bit values, with each of its cells acting as a 4-bit ALU. Medium-grained 
logic blocks may be used to implement datapath circuits of varying bit 
widths, similar to the fine-grained structures. However, with the ability 
to perform more complex operations of a greater number of inputs, this type 
of structure can be used efficiently to implement a wider variety of 

Very coarse-grained architectures are primarily intended for the 
implementation of word-width datapath circuits. Because the logic blocks 
used are optimized for large computations, they will perform these 
operations much more quickly (and consume less chip area) than a set of 
smaller cells connected to form the same type of structure. However, 
because their composition is static, they are unable to leverage 
optimizations in the size of operands. For example, the RaPiD architecture 
(Ebeling et al . 1996), shown in Figure 7, as well as the Chameleon 
architecture (Chameleon 2000), are examples of this very coarse-grained 
type of design. Each of these architectures is composed of word-sized 
adders, multipliers, and registers. If only three 1-bit values are 
required, then the use of these architectures suffers an unnecessary area 
and speed overhead, as all of the bits in the full word size are computed. 
However, these coarse-grained architectures can be much more efficient than 
fine-grained architectures for implementing functions closer to their basic 
word size. 

(FIGURE 7 OMITTED) 

An alternate form of a coarse-grained system is one in which the 
logic blocks are actually very small processors, potentially each with its 
own instruction memory and/or data values. The RE MARC architecture 
(Miyamori and Olukotun 1998) is composed of an 8 x 8 array of 16-bit 
processors. Each of these processors uses its own instruction memory in 
conjunction with a global program counter. This style of architecture 
closely resembles a single-chip multiprocessor, although with much simpler 
component processors because the system is intended to be coupled with a 
host processor. The RAW project (Moritz et al . 1998) is a further example 
of a reconf igurable architecture based on a multiprocessor design. 

The granularity of the FPGA also has a potential effect on the 
reconfiguration time of the device. This is an important issue for run-time 
reconfiguration, which is discussed in further depth in a later 
section. A fine-grained array has many configuration points to 
perform very small computations, and thus requires more data bits during 

3.4. Heterogeneous Arrays 

In order to provide greater performance or flexibility in 
computation, some reconf igurable systems provide a heterogeneous structure, 
where the capabilities of the logic cells are not the same throughout the 
system. One use of heterogeneity in reconf igurable systems is to provide 
multiplier function blocks embedded within the reconf igurable hardware 
(Haynes and Cheung 1998; Chameleon 2000; Xilinx 2001) . Because 
multiplication is one of the more difficult computations to implement 
efficiently in a traditional FPGA structure, the custom multiplication 
hardware embedded within a reconf igurable array allows a system to perform 
even that function well. 

Another use of heterogeneous structures is to provide embedded memory 
blocks scattered throughout the reconf igurable hardware. This allows 
storage of frequently used data and variables, and allows for quick access 
to these values due to the proximity of the memory to the logic blocks that 
access it. Memory structures embedded into the reconf igurable fabric come 
in two forms. The first is simply the use of available LUTs as RAM 
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structures, as can be done in the Xilinx 4000 series (Xilinx 1994) and 
Virtex (Xilinx 1999) FPGAs . Although making these very small blocks into a 
larger RAM structure introduces overhead to the memory system, it does 
provide local, variable width memory structures. 

Some architectures include dedicated memory blocks within their 
array, such as the Xilinx Virtex series (Xilinx 1999, 2001) and Altera 
(Altera 1998) FPGAs, as well as the CS2000 RCP (reconf igurable 
communications processor) device from Chameleon Systems, Inc. (Chameleon 
2000) . These memory blocks have greater performance in large sizes than 
similar-sized structures built from many small LUTs . While these structures 
are somewhat less flexible than the LUT-based memories, they can also 
provide some customization. For example, the Altera FLEX 10K FPGA (Altera 
1998) provides embedded memories that have a limited total number of wires, 
but allow a trade-off between the number of address lines and the data bit 
width. 

When embedded memories are not used for data storage by a particular 
configuration, the area that they occupy does not necessarily have to be 
wasted. By using the address lines of the memory as function inputs and the 
values stored in the memory as function outputs, logical expressions of a 
large number of inputs can be emulated (Altera 19 98; Cong and Xu 19 98; 
Wilton 1998; Heile and Leaver 1999) . In fact, because there may be more 
than one value output from the memory on a read operation, the memory 
structure may be able to perform multiple different computations (one for 
each bit of data output), provided that all necessary inputs appear on the 
address lines. In this manner, the embedded RAM behaves the same as a very 
large LUT . Therefore, embedded memory allows a programmer or a synthesis 
tool to perform a trade-off between logic and memory usage in order to 
achieve higher area efficiency. 

Furthermore, a few of the commercial FPGA companies have announced 
plans to include entire microprocessors as embedded structures within their 
FPGAs. Altera has demonstrated a preliminary ARM9-based Excalibur device, 
which combines reconf igurable hardware with an embedded ARM 9 processor core 
(Altera 2001) . Meanwhile, Xilinx is working with IBM to include a PowerPC 
processor core within the Virtex-II FPGA (Xilinx 2000) . By contrast, 
Adaptive Silicon's focus is to provide reconf igurable logic cores to 
customers for embedding in their own system-on-a-chip (SoC) devices 
(Adaptive 2001) . 

3.5. Routing Resources 

Interconnect resources are provided in a reconf igurable architecture 
to connect together the device's programmable logic elements. These 
resources are usually configurable, where the path of a signal is 
determined at compile or run-time rather than fabrication time. This 
flexible interconnect between logic blocks or computational elements allows 
for a wide variety of circuit structures, each with their own interconnect 
requirements, to be mapped to the reconf igurable hardware. For example, the 
routing for FPGAs is generally island-style, with logic surrounded by 
routing channels, which contain several wires, potentially of varying 
lengths. Within this type of routing architecture, however, there are still 
variations. Some of these differences include the ratio of wires 
to logic in the system, how long each of the wires should be, and whether 
they should be connected in a segmented or hierarchical manner. 

A step in the design of efficient routing structures for FPGAs and 
reconf igurable systems therefore involves examining the logic vs. routing 
area trade-off within reconf igurable architectures. One group has argued 
that the interconnect should constitute a much higher proportion of area in 
order to allow for successful routing under high-logic utilization 
conditions (Takahara et al . 1998). However, for FPGAs, high-LUT utilization 
may not necessarily be the most desirable situation, but rather efficient 
routing usage may be of more importance (DeHon 1999) . This is because the 
routing resources occupy a much larger part of the area of an FPGA than the 
logic resources, and therefore the most area efficient designs will be 
those that optimize their use of the routing resources rather than the 
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logic resources. The amount of required routing does not grow linearly with 
the amount of logic present; therefore, larger devices require even greater 
amounts of routing per logic block than small ones (Trimberger et al. 
1997b) . 

There are two primary methods to provide both local and global 
routing resources, as shown in Figure 8. The first is the use of segmented 
routing (Betz and Rose 1999; Chow et al . 1999a) . In segmented routing, 
short wires accommodate local communications traffic. These short wires can 
be connected together using switchboxes to emulate longer wires. 
Frequently, segmented routing structures also contain longer wires to allow 
signals to travel efficiently over long distances without passing through a 
great number of switches. Hierarchical routing (Aggarwal and Lewis 1994; 
Lai and Wang 1997; Tsu et al . 1999) is the second method to provide both 
local and global communication. Routing within a group (or cluster) of 
logic blocks is at the local level, only connecting within that cluster. At 
the boundaries of these clusters, however, longer wires connect the 
different clusters together. This is potentially repeated at a number of 
levels. The idea behind the use of hierarchical structures is that, 
provided a good placement has been made onto the hardware, most 
communication should be local and only a limited amount of communication 
will traverse long distances. Therefore, the wiring is designed to fit this 
model, with a greater number of local routing wires in a cluster than 
distance routing wires between clusters. 

(FIGURE 8 OMITTED) 

Because routing can occupy a large part of the area of a 
reconf igurable device, the type of routing used must be carefully 
considered. If the wires available are much longer than what is required to 
route a signal, the excess wire length is wasted. On the other hand, if the 
wires available are much shorter than necessary, the signal must pass 
through switchboxes that connect the short wires together into a longer 
wire, or through levels of the routing hierarchy. This induces additional 
delay and slows the overall operation of the circuit. Furthermore, the 
switchbox circuitry occupies area that might be better used for additional 

There are a few alternatives to the island-style of routing 
resources. Systems such as RaPiD (Ebeling et al . 1996) use segmented 
bus-based routing, where signals are full word-sized in width. This is most 
common in the one-dimensional type of architecture, as discussed in the 
next section. 

3.6. One-Dimensional Structures 

Most current FPGAs are of the two-dimensional variety, as shown in 
Figure 9. This allows for a great deal of flexibility, as any signal can be 
routed on a nearly arbitrary path. However, providing this level of routing 
flexibility requires a great deal of routing area. It also complicates the 
placement and routing software, as the software must consider a very large 
number of possibilities. 

(FIGURE 9 OMITTED) 

One solution is to use a more one-dimensional style of architecture, 
also depicted in Figure 9. Here, placement is restricted along one axis. 
With a more limited set of choices, the placement can be performed much 
more quickly. Routing is also simplified, because it is generally along a 
single dimension as well, with the other dimension generally only used for 
calculations requiring a shift operation. One drawback of the 
one-dimensional routing is that if there are not enough routing resources 
in a particular area of a mapped circuit, routing that circuit becomes 
actually more difficult than on a two-dimensional array that provides more 
alternatives. A number of different reconf igurable systems have been 
designed in this manner. Both Garp (Hauser and Wawrzynek 1997) and Chimaera 
(Hauck et al . 1997) are structures that provide cells that compute a small 
number of bit positions, and a row of these cells together computes the 
full data word. A row can only be used by a single configuration, making 
these designs one dimensional. In this manner, each configuration occupies 
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some number of complete rows. Although multiple narrow-width computations 
can fit within a single row, these structures are optimized for word-based 
computations that occupy the entire row. The NAPA architecture (Rupp et al. 

1998) is similar, with a full column of cells acting as the atomic unit for 
a configuration, as is PipeRench (Cadambi et al . 1998; Goldstein et al . 
2000) . 

In some systems, the computation blocks in a one-dimensional 
structure operate on word-width values instead of single bits. Therefore, 
busses are routed instead of individual values. This also decreases the 
time required for routing, as the bits of a bus can be considered together 
rather than as separate routes. As shown previously in Figure 7, RaPiD 
(Ebeling et al . 1996) is basically a one-dimensional design that only 
includes word-width processing elements. The different computation units 
are organized in a single dimension along the horizontal axis . The general 
flow of information follows this layout, with the major routing busses also 
laid out in a horizontal manner. Additionally, all routing is of word-sized 
values, and therefore all routing is of busses, not individual wires. A few 

vertical resources are included in the architecture to allow signals to 
transfer between busses, or to travel from a bus to a computation 
node. However, the majority of the routing in this 
architecture is one-dimensional. 

3.7. Multi-FPGA Systems 

Reconf igurable systems that are composed of multiple FPGA chips 
interconnected on a single processing board have additional hardware 
concerns over single-chip systems. In particular, there is a need for an 
efficient connection scheme between the chips, as well as to external 
memory and the system bus. This is to provide for circuits that are too 
large to fit within a single FPGA, but may be partitioned over the multiple 
FPGAs available. A number of different interconnection schemes have been 
explored (Butts and Batcheller 1991; Hauck et al . 1998a; Hauck 1998; Khalid 

1999) including meshes and crossbars, as shown in Figure 10. A mesh 
connects the nearest-neighbors in the array of FPGA chips. This allows for 
efficient communication between the neighbors, but may require that some 
signals pass through an FPGA simply to create a connection between 
non-neighbors. Although this can be done, and is quite possible, it uses 
valuable I/O resources on the FPGA that forms the routing bridge. One 
system that uses a mesh topology with additional board-level column and row 
busses is the PI system developed within the PAM project (Vuillemin et al . 
1996) . This architecture uses a central array of 16 commercial FPGAs with 
connections to nearest-neighbors. However, four 16-bit row busses and four 
16-bit column busses run the length of the array and facilitate 
communication between non-neighbor FPGAs. 

(FIGURE 10 OMITTED) 

A crossbar attempts to remove this problem by using special 
routing-only chips to connect each FPGA potentially to any other FPGA. The 
inter-chip delays are more uniform, given that a signal travels the exact 
same "distance" to get from one FPGA to another, regardless of where those 
FPGAs are located. However, a crossbar interconnect does not scale easily 
with an increase in the number of FPGAs. The crossbar pattern of the chips 
is fixed at fabrication of the multi-FPGA board. Variants on these two 
basic topologies attempt to remove some of the problems encountered in mesh 
and crossbar topologies (Arnold et al . 1992; Varghese et al . 1993; Buell et 
al. 1996; Vuillemin et al . 1996; Lewis et al . 1997; Khalid and Rose 1998). 
One of these variants can be found in the Splash 2 system (Arnold et al . 
1992; Buell et al . 1996). The predecessor, Splash 1, used a linear systolic 
communication method. This type of connection was found to work quite well 
for a variety of applications. However, this highly constrained 
communication model made some types of computations difficult or even 
impossible. Therefore, Splash 2 was designed to include not only the linear 
connections of Splash 1 that were found to be useful for many applications, 
but also a crossbar network to allow any FPGA to communicate with any other 
FPGA on the same board. For multi-FPGA systems, because of the need for 
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efficient communication between the FPGAs, determining the inter-chip 
routing topology is a very important step in the design process. More 
details on multi-FPGA system architectures can be found elsewhere (Hauck 
1998b; Khalid 1999) . 

3.8. Hardware Summary 

The design of reconf igurable hardware varies wildly from system to 
system. The reconf igurable logic may be used as a configurable functional 
unit, or may be a multi-FPGA stand-alone unit. Within the reconf igurable 
logic itself, the complexity of the core computational units, of logic 
blocks, vary from very simple to extremely complex, some implementing a 
4-bit ALU or even a 16 x 16 multiplication. These blocks are not required 
to be uniform throughout the array, as the use of different types of blocks 
can add high-performance functionality in the case of specialized 
computation circuitry, or expanded storage in the case of embedded memory 
blocks. Routing resources also offer a variety of choices, primarily in 
amount, length, and organization of the wires. Systems have been developed 
that fit into many different points within this design space, and no true 
"best" system has yet been agreed upon. 

4 . SOFTWARE 

Although reconf igurable hardware has been shown to have significant 
performance benefits for some applications, it may be ignored by 
application programmers unless they are able to easily incorporate its use 
into their systems. This requires a software design environment that aids 
in the creation of configurations for the reconf igurable hardware. This 
software can range from a software assist in manual circuit creation to a 
complete automated circuit design system. Manual circuit description is a 
powerful method for the creation of high-quality circuit designs. However, 
it requires a great deal of background knowledge of the particular 
reconf igurable system employed, as well as a significant amount of design 
time. On the other end of the spectrum, an automatic compilation system 
provides a quick and easy way to program for reconf igurable systems. It 
therefore makes the use of reconf igurable hardware more accessible to 
general application programmers, but quality may suffer. 

Both for manual and automatic circuit creation, the design process 
proceeds through a number of distinct phases, as indicated in Figure 11. 
Circuit specification is the process of describing the functions that are 
to be placed on the reconf igurable hardware. This can be done as simply as 
by writing a program in C that represents the functionality of the 
algorithm to be implemented in hardware. On the other hand, this can also 
be as complex as specifying the inputs, outputs, and operation of each 
basic building block in the reconf igurable system. Between these two 
methods is the specification of the circuit using generic complex 
components, such as adders and multipliers, which will be mapped to the 
actual hardware later in the design process. For descriptions in a 
high-level language (HLL) , such as C/C++ or Java, or ones using complex 
building blocks, this code must be compiled into a netlist of gate-level 
components. For the HLL implementations, this involves generating 
computational components to perform the arithmetic and logic operations 
within the program, and separate structures to handle the program control, 
such as loop iterations and branching operations. Given a structural 
description, either generated from a HLL or specified by the user, each 
complex structure is replaced with a network of the basic gates that 
perform that function. 

(FIGURE 11 OMITTED) 

Once a detailed gate- or element-level description of the circuit has 
been created, these structures must be translated to the actual logic 
elements of the reconf igurable hardware. This stage is known as technology 
mapping, and is dependent upon the exact target architecture. For a 
LUT -based architecture, this stage partitions the circuit into a number of 
small subfunctions, each of which can be mapped to a single LUT (Brown et 
al. 1992a; Abouzeid et al . 1993; Sangiovanni-Vincentelli et al . 1993; Hwang 
et al. 1994; Chang et al . 1996; Hauck and Agarwal 1996; Yi and Jhon 1996; 
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Chowdhary and Hayes 1997; Lin et al . 1997; Cong and Wu 1998; Pan and Lin 
1998; Togawa et al . 1998; Cong et al . 1999). Some architectures, such as 
the Xilinx 4000 series (Xilinx 1994), contain multiple LUTs per logic cell. 
These LUTs can be used either separately to generate small functions, of 
together to generate some wider-input functions (Inuani and Saul 1997; Cong 
and Hwang 1998) . By taking advantage of multiple LUTs and the internal 
routing within a single logic cell, functions with more inputs than can be 
implemented using a single LUT can efficiently be mapped into the FPGA 
architecture. Figure 12 shows one example of a wide function mapped to a 
multi-LUT FPGA logic cell. 

(FIGURE 12 OMITTED) 

For reconf igurable structures that include embedded memory blocks, 
the mapping stage may also consider using these memories as logic units 
when they are not being used for data storage . The memories act as very 
large LUTs, where the number of inputs is equal to the number of address 
lines. In order to use these memories as logic, the mapping software must 
analyze how much of the memory blocks are actually used as storage in a 
given mapping. It must then determine which are available in order to 
implement logic, and what part or parts of the circuit are best mapped to 
the memory (Cong and Xu 1998; Wilton 1998). 

After the circuit has been mapped, the resulting blocks must be 
placed onto the reconf igurable hardware. Each of these blocks is assigned 
to a specific location within the hardware, hopefully close to the other 
logic blocks with which it communicates. As FPGA capacities increase, the 
placement phase of circuit mapping becomes more and more time consuming. 
Floorplanning is a technique that can be used to alleviate some of this 
cost. A floorplanning algorithm first partitions the logic cells into 
clusters, where cells with a large amount of communication are grouped 
together. These clusters are then placed as units onto regions of the 
reconf igurable hardware. Once this global placement is complete, the actual 
placement algorithm performs detailed placement of the individual logic 
blocks within the boundaries assigned to the cluster (Sankar and Rose 
1999) . 

The use of a floorplanning tool is particularly helpful for 
situations where the circuit structure being mapped is of a datapath type. 
Large computational components or macros that are found in datapath 
circuits are frequently composed of highly regular logic. These structures 
are placed as entire units, and their component cells are restricted to the 
floorplanned location (Shi and Bhatia 1997; Emmert and Bhatia 1999) . This 
encourages the placer to find a very regular placement of these logic 
cells, resulting in a higher performance layout of the circuit. Another 
technique for the mapping and placement of datapath elements is to perform 
both of these steps simultaneously (Callahan et al . 1998). This method also 
exploits the regularity of the datapath elements to generate mappings and 
placements quickly and efficiently. 

Floorplanning is also important when dealing with hierarchically 
structured reconf igurable designs. In these architectures, the available 
resources have been grouped by the logic or routing hierarchy of the 
hardware. Because performance is best when routing lengths are minimized, 
the cells to be placed should be grouped such that cells that require a 
great deal of communication or which are on a critical path are placed 
together within a logic cluster on the hardware (Krupnova et al . 1997; 
Senouci et al . 1998) . 

After floorplanning, the individual logic blocks are placed into 
specific logic cells. One algorithm that is commonly used is the simulated 
annealing technique (Shahookar and Mazumder 1991; Betz and Rose 1997; 
Sankar and Rose 1999) . This method takes an initial placement of the 
system, which can be generated (pseudo-) randomly, and performs a series of 
"mores' on that layout. A move is simply the changing of the location of a 
single logic cell, or the exchanging of locations of two logic cells. These 
moves are attempted one at a time using random target locations. If a move 
improves the layout, then the layout is changed to reflect that move. If a 
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move is considered to be undesirable, then it is only accepted a small 
percentage of the time. Accepting a few "bad" moves helps to avoid any 
local minima in the placement space. Other algorithms exist that are not so 
based on random movements (Gehring and Ludwig 1996), although this searches 
a smaller area of the placement space for a solution, and therefore may be 
unable to find a solution which meets performance reguirements if a design 
uses a high percentage of the reconf igurable resources. 

Finally, the different reconf igurable components comprising the 
application circuit are connected during the routing stage. Particular 
signals are assigned to specific portions of the routing resources of the 
reconf igurable hardware. This can become difficult if the placement causes 
many connected components to be placed far from one another, as the signals 
that travel long distances use more routing resources than those that 
travel shorter ones. A good placement is therefore essential to the routing 
process. One of the challenges in routing for FPGAs and reconf igurable 
systems is that the available routing resources are limited. In general 
hardware design, the goal is to minimize the number of routing tracks used 
in a channel between rows of computation units, but the channels can be 
made as wide as necessary. In reconf igurable systems, however, the number 
of available routing tracks is determined at fabrication time, and 
therefore the routing software must perform within these boundaries. Thus, 
FPGA routing concentrates on minimizing congestion within the available 
tracks (Brown et al . 1992b; McMurchie and Ebeling 1995; Alexander and 
Robins 1996; Chan and Schlag 1997; Lee and Wu 1997; Thakur et al . 1997; Wu 
and Marek-Sadowska 1997; Swartz et al . 1998; Nam et al . 1999). Because 
routing is one of the more time-intensive portions of the design cycle, it 
can be helpful to determine if a placed circuit can be routed before 
actually performing the routing step. This quickly informs the designer if 
changes need to be made to the layout or a larger reconf igurable structure 
is required (Wood and Rutenbar 1997; Swartz et al . 1998). 

Each of the design phases mentioned above may be implemented either 
manually or automatically using compiler tools. The operation of some of 
these individual steps are described in greater depth in the following 
sections . 

4.1. Hardware-Software Partitioning 

For systems that include both reconf igurable hardware and a 
traditional microprocessor, the program must first be partitioned into 
sections to be executed on the reconf igurable hardware and 
sections to be executed in software on the microprocessor. In 
general, complex control sequences such as variable-length loops are more 
efficiently implemented in software, while fixed datapath operations may be 
more efficiently executed in hardware. 

Most compilers presented for reconf igurable systems generate only the 
hardware configuration for the system, rather than both hardware and 
software. In some cases, this is because the reconf igurable hardware may 
not be coupled with a host processor, so only a hardware configuration is 
necessary. For cases where reconf igurable hardware does operate alongside a 
host microprocessor, some systems currently require that the hardware 
compilation be performed separately from the software compilation, and 
special functions are called from within the software in order to configure 
and control the reconf igurable hardware. However, this requires effort on 
the part of the designer to identify the sections that should be 
mapped to hardware, and to translate these into special hardware functions. 
In order to make the use of the reconf igurable hardware transparent to the 
designer, the partitioning and programming of the hardware should occur 
simultaneously in a single programming environment. 

For compilers that manage both the hardware and software aspects of 
application design, the hardware/ software partitioning can be performed 
either manually, or automatically by the compiler itself. When the 
partitioning is performed by the programmer, compiler directives are used 
to mark sections of program code for hardware compilation. The 
NAPA C language (Gokhale and Stone 1998) provides pragma statements to 
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allow a programmer to specify whether a section of code is to be 
executed in software on the Fixed Instruction Processor (FIP) , or in 
hardware on the Adaptive Logic Processor (ALP) . Cardoso and Neto (1999) 
present another compiler that requires the user to specify (using 
information gained through the use of profiling tools) which areas of code 
to map to the reconf igurable hardware. 

Alternately, the hardware/software partitioning can be done 
automatically (Chichkov and Almeida 1997; Kress et al . 1997; Callahan et 
al. 2000; Li et al . 2000al. In this case, the compiler will use cost 
functions based upon the amount of acceleration gained through the 
execution of a code fragment in hardware to determine whether the cost of 
configuration is overcome by the benefits of hardware execution. 

4.2. Circuit Specification 

In order to use the reconf igurable hardware, designers must somehow 
be able to specify the operation of their custom circuits. Before 
high-level compilation tools are developed for a specific reconf igurable 
system, this is done through hand mapping of the circuit, where the 
designer specifies the operation of the components in the configurable 
system directly. Here, the designers utilize the basic building blocks of 
the reconf igurable system to create the desired circuit. This style of 
circuit specification is primarily useful only when a software front-end 
for circuit design is unavailable, or for the design of small circuits or 
circuits with very high performance requirements. This is due to the great 
amount of time involved in manual circuit creation. However, for circuits 
that can be reasonably hand mapped, this provides potentially the smallest 
and fastest implementation. 

Because not all designers can be intimately familiar with every 
reconf igurable architecture, some design tools abstract the specifics of 
the target architecture. Creating a circuit using a structural design 
language involves describing a circuit using building blocks such as gates, 
flip-flops and latches (Bellows and Hutchings 1998; Gehring and Ludwig 
1998; Hutchings et al . 1999). The compiler then maps these modules to one 
or more basic components of the architecture of the reconf igurable system. 
Structural VHDL is one example of this type of programming, and commercial 
tools are available for compiling from this language into vendor-specific 
FPGAs 

However, these two methods require that the designer possess either 
ah intimate knowledge of the targeted reconf igurable hardware, or at least 
a working knowledge of the concepts involved in hardware design. In order 
to allow a greater number of software developers to take advantage of 
reconf igurable computing, tools that allow for behavioral circuit 
descriptions are being developed. These systems trade some area and 
performance quality for greater flexibility and ease of use. 

Behavioral circuit design is similar to software design because the 
designer indicates the steps a hardware subsystem must go through in order 
to perform the desired computation rather than the actual composition of 
the circuit. These behavioral descriptions can be either in a generic 
hardware description language such as VHDL or Verilog, or a 
general-purpose high-level language such as C/C++ or Java. The eventual 
goal of this type of compilation is to allow users to write programs in 
commonly used languages that compile equally well, without modification, to 
both a traditional software executable and to an executable which leverages 
reconf igurable hardware . 

Working towards this direction, Transmogr if ier C (Galloway 1995) 
allows a subset of the C language to be used to describe hardware circuits. 
While multiplication, division, pointers, arrays, and a few other C 
language specifics are not supported, this system provides a behavioral 
method of circuit description using a primitive form of the C language. 
Similarly, the C++ programming environment used for the PI system 
(Vuillemin et al. 1996) provides a hybrid method of description, using a 
combination of behavioral and structural design. Synopsys ' CoCentric 
compiler (Synopsys 2000), which can be targeted to the Xilinx Virtex series 
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of FPGA, uses SystemC to provide for behavioral compilation of C/C++ with 
the assistance of a set of additional hardware-defining classes. Other 
compilers, such as Nimble (Li et al . 2000a) and the Garp compiler (Callahan 
et al. 2000), are fully behavioral C compilers, handling the full set of 
the ANSI C language. 

Although behavioral description, and HLL description in particular, 
provides a convenient method for the programming of reconf igurable systems, 
it does suffer from the drawback that it tends to produce larger and slower 
designs than those generated by a structural description or hand-mapping. 
Behavioral descriptions can leave many aspects of the circuit unspecified. 
For example, a compiler that encounters a while loop must generate 
complicated control structures in order to allow for an unspecified number 
of iterations. Also, in many HLL implementations, optimizations based upon 
the bit width of operands cannot be performed. The compiler is generally 
unaware of any application-specific limitations on the operand size; it 
only sees the programmer's choice of data format in the program. Problems 
such as these might be solved through additional programmer effort to 
replace while loops whenever possible with for loops, and to use compiler 
directives to indicate exact sizes of operands (Galloway 1995; Gokhale and 
Stone 1998) . This method of hardware design falls between structural 
description and behavioral description in complexity, because although the 
programmers do not need to know a great deal about hardware design, they 
are required to follow additional guidelines that are not required for 
software-only implementations. 

4.3. Circuit Libraries 

The use of circuit or macro libraries can greatly simplify and speed 
the design process. By predesigning commonly used structures such as 
adders, multipliers, and counters, circuit creation for configurable 
systems becomes largely the assembly of high-level components, and only 
application-specific structures require detailed design. The actual 
architecture of the reconf igurable device can be abstracted, provided only 
library components are used, as these low-level details will already have 
been encapsulated within the library structures. Although the users of the 
circuit library may not know the intricacies of the destination 
architecture, they are still able to make use of architecture-specific 
optimizations, such as specialized carry chains. This is because designers 
very familiar with the details of the target architecture create the 
components within a circuit library. They can take advantage of 
architecture specifics when creating the modules to make these components 
faster and smaller than a designer unfamiliar with the architecture likely 
would. An added benefit of the architecture abstraction is that the use of 
library components can also facilitate design migration from one 
architecture to another, because designers are not required to learn a new 
architecture, but only to indicate the new target for the library 
components. However, this does require that a circuit library contain 
implementations for more than one architecture. 

One method for using library components is to simply instantiate them 
within an HDL design (Xilinx 1997; Altera 1999) . However, circuit libraries 
can also be used in general language compilers by comparing the dataflow 
graph of the application to the dataflow graphs of the library macros 
(Cadambi and Goldstein 1999). If a dataflow representation of a macro 
matches a portion of the application graph, the corresponding macro is used 
for that part of the configuration. 

Another benefit of circuit design with library macros is that of fast 
compilation. Because the library structures may have been premapped, 
preplaced, and prerouted (at least within the macro boundaries) , the actual 
compile time is reduced to the time required to place the library 
components and route between them. For example, fast configuration was one 
of the main motivations for the creation of libraries for circuit design in 
the DISC reconf igurable image processing system (Hutchings 1997). 

4.4. Circuit Generators 

Circuit generators fulfill a role similar to circuit libraries, in 
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that they provide optimized high-level structures for use within larger 
applications. Again, designers are not reguired to understand the low-level 
details of particular architectures. However, circuit generators create 

opposed to circuit libraries that only provide static structures. For 
example, a circuit generator can create an adder structure of the exact bit 
width required by the designer, whereas a circuit library is likely to 
contain a limited number of adder structures, none of which may be of the 
correct size. Circuit generators are therefore more flexible than circuit 
libraries because of the customization allowed. 

Some circuit generators, such as MacGen (Yasar et al . 1996), are 
executed at the command line using custom description files to generate 
physical design layout data files. Newer circuit generators, however, are 
functions or methods called from high-level language programs. PAM-Blox 
(Mencer et al . 1998), for example, is a set of circuit generators executed 
in C++ that generate structures for use with the PCI Pamette reconf igurable 
processing board. The circuit generator presented by Chu et al . (1998) 
contains a number of Java classes to allow a programmer to generate 
arbitrarily sized arithmetic and logical components for a circuit. Although 
the examples presented in that paper were mapped to a Xilinx 4000 series 
FPGA, the generator uses architecture specific libraries for module 
generation. The target architecture can therefore be changed through the 
use of a different design library. The Carry Look-Ahead circuit generator 
described by Stohmann and Barke (1996) is also retargetable , because it 
maps to an FPGA logic cell architecture defined by the user. 

One drawback of the circuit generators is that they depend on a 
regular logic and routing structure. Hierarchical routing structures (such 
as those present in the Xilinx 6200 series (Xilinx 1996)) and specialized 
heterogeneous logic blocks are frequently not accounted for. Therefore, 
some optimized features of a particular architecture may be unused. For 
these cases, a circuit macro from a library may provide a more highly 
optimized structure than one created with a circuit generator, provided 
that the library macro fits the needs of the application. 

4.5. Partial Evaluation 

Functions that are to be implemented on the reconf igurable array 
should occupy as little area as possible, so as to maximize the number of 
functions that can be mapped to the hardware. This, combined with the 
minimization of the delay incurred by each circuit, increases the overall 
acceleration of the application. Partial evaluation is the process of 
reducing hardware requirements for a circuit structure through optimization 
based upon known static inputs. Specifically, if an input is known to be 
constant, that value can potentially be propagated through one or more 
gates in the structure at compile time, and only the portions of a circuit 
that depend on time-varying inputs need to be mapped to the reconf igurable 
structure . 

One example of the usefulness of this operation is that of constant 
coefficient multipliers. If one input to a multiplier is constant, a 
multiplier object can be reduced from a general-purpose multiplier to a set 
of additions with static-length shifts between them corresponding to the 
locations of Is in the binary constant. This type of reduction leads to a 
lower area requirement for the circuit, and potentially higher performance 
due to fewer gate delays encountered on the critical path. Partial 

where the constants passed to the generator function are used to simplify 
the created hardware circuit (Wang and Lewis 1997; Chu et al . 1998) . Other 
examples Of this type of optimization for specific algorithms include the 
partial evaluation of DES encryption circuits (Leonard and Mangione-Smith 
1997), and the partial evaluation of constant multipliers and fixed 
polynomial division circuits (Payne 1997) . 

4.6. Memory Allocation 

As with traditional software programs, it may be necessary in 
reconf igurable computing to allocate memories to hold 
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variables and other data. Off-chip memories may be added to the 
reconf igurable system. Alternately, if a reconf igurable system includes 
memory blocks embedded into the reconf igurable logic, these may be used, 
provided that the storage requirements do not surpass the available 
embedded memory. If multiple off-chip memories are available to a 
reconf igurable system, variables used in parallel should be placed into 
different memory structures, such that they can be accessed simultaneously 
(Gokhale and Stone 1999) . When smaller embedded memory units are used, 
larger memories can be created from the smaller ones. However, in this 
case, it is desirable to ensure that each smaller memory is close to the 
computation that most requires its contents (Babb et al . 1999). As 
mentioned earlier, the small embedded memories that are not allocated for 
data storage may be used to perform logic functions. 

4.7. Parallelization 

One of the benefits of reconf igurable computing is the ability to 
execute multiple operations in parallel. In cases where circuits are 
specified using a structural hardware description language, the user 
specifies all structures and timing, and therefore either implicitly or 
explicitly specifies any parallel operation. However, for behavioral and 
HLL descriptions, there are two methods to incorporate parallelism: manual 
parallelization through special instructions or compiler directives, and 
automatic parallelization by the compiler. 

To manually incorporate parallelism within an application, the 
programmer can specifically mark sections of code that should 
run as parallel threads, and use similar operations to those used in 
traditional parallel compilers (Cronquist et al . 1998; Gokhale and Stone 
1998) . For example, a signal/wait technique can be used to perform 
synchronization of the different threads of the computation. The RaPiD-B 
language (Cronquist et al . 1998) is one that uses this methodology. 
Although the NAPA C compiler (Gokhale and Stone 1998) requires programmers 
to mark the areas of code for executing the host processor and the 
reconf igurable hardware in parallel, it also detects and exploits 
fine-grained parallelism within computations destined for the 
reconf igurable hardware. 

Automatic parallelization of inner loops is another common technique 
in reconf igurable hardware compilers to attempt to maximize the use of the 
reconf igurable hardware. The compiler will select the innermost loop level 
to be completely unrolled for parallel execution in hardware, potentially 
creating a heavily pipelined structure (Cronquist et al . 1998; Weinhardt 
and Luk 1999) . For these cases, outer loops may not have multiple 
iterations executing simultaneously. Any loop reordering to improve the 
parallelism of the circuit must be done by the programmer. However, some 
compiler systems have taken this procedure a step further and focus on the 
parallelization of all loops within the program, not just the inner loops 
(Wang and Lewis 1997; Budiu and Goldstein 1999) . This type of compiler 
generates a control flow graph based upon the entire program source code. 
Loop unrolling is used in order to increase the available parallelism, and 
the graph is then used to schedule parallel operations in the hardware. 

4.8. Multi-FPGA System Software 

When reconf igurable systems use more than one FPGA to form the 
complete reconf igurable hardware, there are additional compilation issues 
to deal with (Hauck and Agarwal 1996) . The design must first be partitioned 
into the different FPGA chips (Hauck 1995; Acock and Dimond 1997; Vahid 
1997; Brasen and Saucier 1998; Khalid 1999) . This is generally done by 
placing each highly connected portions of a circuit into a single chip. 
Multi-FPGA systems have a limited number of I/O pins that connect the chips 
together, and therefore their use must be minimized in the overall circuit 
mapping. Also, by minimizing the amount of routing required between the 
FPGAs, the number of paths with a high (inter-chip) delay is reduced, and 
the circuit may have an overall higher performance. Similarly, those 
sections of the circuit that require a short delay time must be 
placed upon the same chip. Global placement then determines which of the 
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actual FPGAs in the multi-FPGA system will contain each of the partitions. 

After the circuit has been partitioned into the different FPGA chips, 
the connections between the chips must be routed (Mak and Wong 1997; 
Ejnioui and Ranganathan 1999) . A global routing algorithm determines at a 
high level the connections between the FPGA chips. It first selects a 
region of output pins on the source FPGA for a given signal, and determines 
which (if any) routing switches or additional FPGAs the signal must pass 
through to get to the destination FPGA. Detailed routing and pin assignment 
(Slimane-Kade et al . 1994; Hauck and Borriello 1997; Mak and Wong 1997; 
Ejnioui and Ranganathan 1999) are then used to assign signals to traces on 
an existing multi-FPGA board, or to create traces for a multi-FPGA board 
that is to be created specifically to implement the given circuit . 

Because multi-FPGA systems use interchip connections to allow the 
circuit partitions to communicate, they frequently require a higher 
proportion of I/O resources vs. logic in each chip than is normally 
required in single-FPGA use. For this reason, some research has focused on 
methods to allow pins of the FPGAs to be reused for multiple signals. This 
procedure is referred to as Virtual Wires (Babb et al . 1993; Agarwal 1995; 
Selvidge et al . 1995), and allows for a flexible trade-off between logic 
and I/O within a given multi-FPGA system. Signals are multiplexed onto a 
single wire by using multiple virtual clock cycles, one per multiplexed 
signal, within a user clock cycle, thus pipelining the communication. In 
this manner, the I/O requirements of a circuit can be reduced, while the 
logic requirements (because of the added circuitry used for the 
multiplexing) are increased. 

4.9. Design Testing 

After compilation, an application needs to be tested for correct 
operation before deployment. For hardware configurations that have been 
generated from behavioral descriptions, this is similar to the debugging of 
a software application. However, structurally and manually created circuits 
must be simulated and debugged with techniques based upon those from the 
design of general hardware circuits. For these structures, simulation and 
debugging are critical not only to ensure proper circuit operation, but 
also to prevent possible incorrect connections from causing a short within 
the circuit, which can damage the reconf igurable hardware. 

There are several different methods of observing the behavior of a 
configuration during simulation. The contents of memory structures within 
the design can be viewed, modified, or saved. This allows on-the-fly 
customization of the simulated execution environment of the reconf igurable 
hardware, as well as a method for examining the computation results. The 
input and output values of circuit structures and substructures can also be 
viewed either on a generated schematic drawing or with a traditional 
waveform output. By examining these values, the operation of the circuit 
can be verified for correctness, and conflicts on individual wires can be 
seen. A number of simulation and debugging software systems have been 
developed that use some or all of these techniques (Arnold et al . 1992; 
Buell et al. 1996; Gehring and Ludwig 1996; Lysaght and Stockwood 1996; 
Bellows and Hutchings 1998; Hutchings et al . 1999; McKay and Singh 1999; 
Vasilko and Cabanis 1999) . 

4.10. Software Summary 

Reconf igurable hardware systems require software compilation tools to 
allow programmers to harness the benefits of reconf igurable computing. On 
one end of the spectrum, circuits for reconf igurable systems can be 
designed manually, leveraging all application-specific and 
architecture-specific optimizations available to generate a 

high-performance application. However, this requires a great deal of time 
and effort on the part of the designer. At the opposite end of the spectrum 
is fully automatic compilation of a high-level language. Using the 
automatic tools, a software programmer can transparently utilize the 
reconf igurable hardware without the need for direct intervention. The 
circuits created using this method, while quickly and easily created, are 
generally larger and slower than manually created versions. The actual 
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tools available for compilation onto reconf igurable systems fall at various 
points within this range, where many are partially automated but require 
some amount of manual aid. Circuit designers for reconf igurable systems 
therefore face a trade-off between the ease of design and the quality of 
the final layout. 

5 . RUN-TIME RECONFIGURATION 

Frequently, the areas of a program that can be accelerated through 
the use of reconf igurable hardware are too numerous or complex to be loaded 
simultaneously onto the available hardware. For these cases, it is 
beneficial to be able to swap different configurations in and out of the 
reconf igurable hardware as they are needed during program execution (Figure 
13) . This concept is known as run-time reconfiguration (RTR) . 

(FIGURE 13 OMITTED) 

Run-time reconfiguration is based upon the concept of virtual 
hardware, which is similar to virtual memory. Here, the physical hardware 
is much smaller than the sum of the resources required by each of the 
configurations. Therefore, instead of reducing the number of configurations 
that are mapped, we instead swap them in and out of the actual hardware as 
they are needed. Because run-time reconfiguration allows more 
sections of an application to be mapped into hardware than can 
be fit in a non-run-time reconf igurable system, a greater portion of the 
program can be accelerated. This provides potential for an overall 
improvement in performance. 

During a single program's execution, configurations are swapped in 
and out of the reconf igurable hardware. Some of these configurations will 
likely require access to the results of other configurations. 
Configurations that are active at different periods in time therefore must 
be provided with a method to communicate with one another. Primarily, this 
can be done through the use of registers (Ebeling et al . 1996; Cadambi et 
al. 1998; Rupp et al . 1998; Scalera and Vazquez 1998), the contents of 
which can remain intact between reconfigurations. This allows one 
configuration to store a value, and a later configuration to read back that 
value for use in further computations. An alternative for reconf igurable 
systems that do not include state-holding devices is to write the result 
back to registers or memory external to the reconf igurable array, which is 
then read back by successive configurations (Hauck et al . 1997) . 

There are a few different configuration memory styles that can be 
used with reconf igurable systems. A single context device is a serially 
programmed chip that requires a complete reconfiguration in order to change 
any of the programming bits. A multicontext device has multiple layers of 
programming bits, each of which can be active at a different point in time. 
Devices that can be selectively programmed without a complete 
reconfiguration are called partially reconf igurable . These different types 
of configuration memory are described in more detail later. An advantage of 
the multicontext FPGA over a single context architecture is that it allows 
for an extremely fast context switch (on the order of nanoseconds), whereas 
the single context may take milliseconds or more to reprogram. The 
partially reconf igurable architecture is also more suited to run-time 
reconfiguration than the single context, because small areas of the array 
can be modified without requiring that the entire logic array be 
reprogrammed . 

For all of these run-time reconf igurable architectures, there are 

that only configure at the beginning of an application. For example, 
run-time reconf igurable systems are able to optimize based on values that 
are known only at run-time. Furthermore, compilers must consider the 
run-time reconf igurability when generating the different circuit mappings, 
not only to be aware of the increase in time-multiplexed capacity, but also 
to schedule reconfigurations so as to minimize the overhead that they 
incur. These software issues, as well as an overview of methods to perform 
fast configuration, will be explored in the sections that 
follow. 
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5.1. Reconf igurable Models 

Traditional FPGA structures have been single context, only allowing 
one full-chip configuration to be loaded at a time. However, designers of 
reconf igurable systems have found this style of configuration to be too 
limiting or slow to efficiently implement run-time reconfiguration. The 
following discussion defines the single context device, and further 
considers newer FPGA designs (multicontext and partially reconf igurable ) , 
along with their impact on run-time reconfiguration. 

5.1.1. Single Context. Current single context FPGAs are programmed 
using a serial stream of configuration information. Because only sequential 
access is supported, any change to a configuration on this type of FPGA 
requires a complete reprogramming of the entire chip. Although this does 
simplify the reconfiguration hardware, it does incur a high overhead when 
only a small part of the configuration memory needs to be changed. Many 
commercial FPGAs are of this style, including the Xilinx 4000 series 
(Xilinx 1994), the Altera FlexlOK series (Altera 1998), and Lucent's Orca 
series (Lucent 1998). This type of FPGA is therefore more suited for 
applications that can benefit from reconf igurable computing without 
run-time reconfiguration. A single context FPGA is depicted in Figure 14. 

(FIGURE 14 OMITTED) 

In order to implement run-time reconfiguration onto a single context 
FPGA, the configurations must be grouped into contexts, and each full 
context is swapped in and out of the FPGA as needed. Because each of these 
swap operations involve reconfiguring the entire FPGA, a good partitioning 
of the configurations between contexts is essential in order to minimize 
the total reconfiguration delay. If all the configurations used within a 
certain time period are present in the same context, no reconfiguration 
will be necessary. However, if a number of successive configurations are 
each partitioned into different contexts, several reconfigurations will be 
needed, slowing the operation of the run-time reconf igurable system. 

5.1.2. Multicontext. A multicontext FPGA includes multiple memory 
bits for each programming bit location (DeHon 1996; Trimberger et al . 
1997a; Scalera and Vazquez 1998; Chameleon 2000) . These memory bits can be 
thought of as multiple planes of configuration information, as 

shown in Figure 14. One plane of configuration information can 

be active at a given moment, but the device can quickly switch between 

different planes, or contexts, of already-programmed 

configurations. In this manner, the multicontext device can be considered a 
multiplexed set of single context devices, which requires that a context be 
fully reprogrammed to perform any modification. This system does allow for 
the background loading of a context, where one plane is active 
and in execution while an inactive place is in the process of being 
programmed. Figure 15 shows a multicontext memory bit, as used in 
(Trimberger et al . 1997a) . A commercial product that uses this technique is 
the CS2000 RCP series from Chameleon, Inc (Chameleon 2000) . This device 
provides two separate planes of programming information. At any 
given time, one of these planes is controlling current execution 
on the reconf igurable fabric, and the other plane is available 
for background loading of the next needed configuration. 
(FIGURE 15 OMITTED) 

Fast switching between contexts makes the grouping of the 
configurations into contexts slightly less critical, because if a 
configuration is on a different context than the one that is currently 
active, it can be activated within an order of nanoseconds, as opposed to 
milliseconds or longer. However, it is likely that the number of contexts 
within a given program is larger than the number of contexts available in 
the hardware. In this case, the partitioning again becomes important to 
ensure that configurations occurring in close temporal proximity are in a 
set of contexts that are loaded into the multicontext device at the same 
time. More aspects involving temporal partitioning for single- and 
multicontext devices will be discussed in the section on 
compilers for run-time reconf igurable systems. 
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5.1.3. Partially Reconf igurable . In some cases, configurations do not 
occupy the full reconf igurable hardware, or only a part of a configuration 
requires modification. In both of these situations, a partial 
reconfiguration of the array is required, rather than the full 
reconfiguration required by a single- or multicontext device. In a 
partially reconf igurable FPGA, the underlying programming bit layer 
operates like a RAM device. Using addresses to specify the target location 
of the configuration data allows for selective reconfiguration of the 
array. Frequently, the undisturbed portions of the array may continue 
execution, allowing the overlap of computation with reconfiguration. This 
has the benefit of potentially hiding some of the reconfiguration latency. 

When configurations do not require the entire area available within 
the array, a number of different configurations may be loaded into unused 
areas of the hardware at different times. Since only part of the array is 
reconfigured at a given point in time, the entire array does not require 
reprogramming . Additionally, some applications require the updating of only 
a portion of a mapped circuit, while the rest should remain intact, as 
shown in Figure 14. For example, in a filtering operation in signal 
processing, a set of constant values that change slowly over time may be 
reinitialized to a new value, yet the overall computation in the circuit 
remains static. Using this selective reconfiguration can greatly reduce the 
amount of configuration data that must be transferred to the FPGA. Several 
run-time reconf igurable systems are based upon a partially reconf igurable 
design, including Chimaera (Hauck et al . 1997), PipeRench (Cadambi et al . 
1998; Goldstein et al . 2000), HAP A (Rupp et al . 1998), and the Xilinx 6200 
and Virtex FPGAs (Xilinx 1996, 1999). 

Unfortunately, since address information must be supplied with 
configuration data, the total amount of information transferred to the 
reconf igurable hardware may be greater than what is required with a single 
context design. This makes a full reconfiguration of the entire array 
slower than the single context version. However, a partially reconf igurable 
design is intended for applications in which the size of the configurations 
is small enough that more than one can fit on the available hardware 
simultaneously. Plus, as we discuss in subsequent sections, a 
number of fast configuration methods have been explored for partially 
reconf igurable systems in order to help reduce the configuration data 
traffic requirements. 

5.1.4. Pipeline Reconf igurable . A modification of the partially 
reconf igurable FPGA design is one in which the partial reconfiguration 
occurs in increments of pipeline stages. This style of reconf igurable 
hardware is called pipeline reconf igurable, or sometimes a striped FPGA 
(Luk et al. 1997b; Cadambi et al . 1998; Deshpande and Somani 1999; 
Goldstein et al . 2000). Each stage is configured as a whole. This is 
primarily used in datapathsty le computations, where more pipeline stages 
are used than can fit simultaneously on available hardware. Figure 16 shows 
an example of a pipeline reconf igurable array implementing more pipeline 
stages than can fit on the available hardware. In a pipeline-reconf igurable 
FPGA, there are two primary execution possibilities. Either the number of 
hardware pipeline stages available is greater than or equal to the number 
of pipeline stages of the designed circuit (virtual pipeline stages), or 
the number of virtual pipeline stages will exceed the number of hardware 
pipeline stages. The first case is straightforward: the circuit is simply 
mapped to the array, and some hardware stages may go unused. The second 
case is more complex and is the one that requires runtime reconfiguration. 
The pipeline stages are configured one by one, from the start of the 
pipeline, through the end of the available hardware stages (steps 1, 2, and 
3 in Figure 16) . After each stage is programmed, it begins computation. In 
this manner, the configuration of a stage is exactly one step ahead of the 
flow of data. Once the hardware pipeline has been completely filled, reuse 
of the hardware pipeline stages begins. Configuration of the next virtual 
stage begins at the first pipeline location in the hardware (step 4), 
overwriting the first virtual pipeline stage. The reconfiguration of the 
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hardware pipeline stages continues until the last virtual pipeline stage 
has been programmed (step 7), at which point the first stage of the virtual 
pipeline is again configured onto the hardware for the next data set. These 
structures also allow for the overlap of configuration and execution, as 
one pipeline stage is configured while the others are executing. Therefore, 
N-l data values are processed each time the virtual pipeline is fully 
traversed on an N-stage hardware system. 
(FIGURE 16 OMITTED) 

5.2. Run-Time Partial Evaluation 

One of the advantages that a run-time reconf igurable device has over 
a system that is only programmed at the beginning of an application is the 
ability to perform hardware optimizations based upon values determined at 
run-time. Partial evaluation was already discussed in this article in 
reference to compilation optimizations for general reconf igurable systems. 
Run-time partial evaluation allows for the further exploitation of 
"constants" because the configurations can be modified based not only on 
completely static values, but also those that change slowly over time 
(Burns et al . 1997; Luk et al . 1997a; Payne 1997; Wirthlin and Hutchings 
1997; Chu et al . 1998; McKay and Singh 1999). This gives reconf igurable 
circuits the potential to achieve an even higher performance than an ASIC, 
which must retain generality in these situations. The circuit in tsystem 
can be customized to the application at a given time, rather than to the 
application as a category. For example, where an ASIC may have to include a 
generic multiplier, a reconf igurable system could instantiate a constant 
coefficient multiplier that changes over time. Additionally, partial 
evaluation can be used in encryption systems (Leonard and Mangione-Smith 
1997). A key-specific reconf igurable encrypter or decrypter is optimized 
for the particular key being used, but retains the ability to use more than 
one key over the lifetime of the hardware (unlike a key-specialized ASIC) 
or during actual run-time. 

Although partial evaluation can be used to reduce the overall area 
requirements of a circuit by removing potentially extraneous hardware 
within the implementation, occasionally it is preferable to reserve 
sufficient area for the largest case, and have all mappings occupy that 
area. This allows the partially evaluated portion of a given configuration 
to be reconfigured, while leaving the remainder of the circuit intact. For 
example, if a constant coefficient multiplier within a larger configuration 
requires that the constant be changed, only the area occupied by the 
multiplier requires reconfiguration. This is true even if the new constant 
coefficient multiplier is a larger structure than the previous one, because 
the reserved area for it is based upon the largest possibility (McKay and 
Singh 1999) . Although partial evaluation does not minimize the area 
occupied by the circuit in this case, the speed of configuration is 
improved by making the multiplier a modular replaceable component. 
Additionally, this method retains the speed benefits of partial 
reconfiguration because it still minimizes the logic and routing actually 
used to implement the structure. 

5.3. Compilation and Configuration Scheduling 

For some reconf igurable systems, a configuration requires programming 
the reconf igurable hardware only at the start of its execution. On the 
other hand, in a run-time reconf igurable system, the circuits loaded on the 
hardware change over time. If the user must specify by hand the loading and 
execution of the circuits in the reconf igurable hardware, then the 
compilers must include methods to indicate these operations. JHDL (Bellows 
and Hutchings 1998; Hutchings et al . 1999) is one such compiler. It 
provides for the instantiation of configurations through the use of Java 
constructors, and the removal of the circuits from the hardware by using a 
destructor on the circuit objects. This allows the programmer to indicate 
exactly the loading pattern of the configurations. 

Alternately, the compiler can automate the use of the run-time 
reconf igurable hardware. For a single context or multicontext device, 
configurations must be temporally partitioned into a number of different 
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full contexts of configuration information. This involves determining which 
configurations are likely to be used near in time to one another, and which 
configurations are able to fit together onto the reconf igurable hardware. 
Ideally, the number of reconfigurations that are to be performed is 
minimized. By reducing the number of reconfigurations, the proportion of 
time spent in reconfiguration (compared to the time spent in useful 
computation) is reduced. 

The problem of forming and scheduling single- and multiconf iguration 
contexts for use in single context or multicontext FPGA designs has been 
discussed by a number of groups (Chang and Marek-Sadowska 1998; Trimberger 
1998; Liu and Wong 1999; Purna and Bhatia 1999; Li et al . 2000a). In 
particular, a single circuit that is too large to fit within the 
reconf igurable hardware may be partitioned over time to form a sequential 
set of configurations. This involves examining the control flow graph of 
the circuit and dividing the circuit into distinct computation nodes. The 
nodes can then be grouped together within contexts, based upon their 
proximity to one another within the flow control graph. If possible, those 
configurations that are used in quick succession will be placed within the 
same group. These groups are finally mapped into full contexts, to be 
loaded into the reconf igurable hardware at run-time. Nimble (Li et al . 
2000a) is one of the compilers that perform this type of operation. This 
compiler focuses on mapping core loops within C code to reconf igurable 
hardware. Hardware models for the candidate loops that will fit within the 
reconf igurable hardware are first extracted from the C application. Then 
these loops are grouped into individual configurations using a partitioning 
method in order to encourage the hardware loops that are used in close 
temporal proximity to be mapped to the same configuration, reducing 
configuration overhead. 

For partially reconf igurable designs, the compiler must determine a 
good placement in order to prevent configurations that are used together in 
close temporal proximity from occupying the same resources. Again, through 
minimizing the number of reconfigurations, the overall performance of the 
system is increased, as configuration is a slow process (Li et al . 2000b) . 
An alternative approach, which allows the final placement of a 
configuration to be determined at run-time, is also discussed within the 
Fast Configuration section of this article. 

5.4. Fast Configuration 

Because run-time reconf igurable systems involve reconfiguration 
during program execution, the reconfiguration must be done as efficiently 
and as quickly as possible. This is in order to ensure that the overhead of 
the reconfiguration does not eclipse the benefit gained by hardware 
acceleration. Stalling execution of either the host processor or the 
reconf igurable hardware because of configuration is clearly undesirable. In 
the DISC II system, from 25% (Wirthlin and Hutchings 1996) to 71% (Wirthlin 
and Hutchings 1995) of execution time is spent in reconfiguration, while in 
the UCLA ATR work this figure can rise to over 98.5% (Mangione-Smith 1999). 
If the delays caused by reconfiguration are reduced, performance can be 
greatly increased. Therefore, fast configuration is an important area of 
research for run-time reconf igurable systems. 

There are a number of different tactics for reducing the 
configuration overhead. First, loading of the configurations can be timed 
such that the configuration overlaps as much as possible with the execution 
of instructions by the host processor. Second, compression techniques can 
be introduced to decrease the amount of configuration data that must be 
transferred to the system. Third, specialized hardware can be used to 
adjust the physical location of configurations at run-time based on where 
the free area on the hardware is located at any given time. Finally, the 
actual process of transferring the data from the host processor to the 
reconf igurable hardware can be modified to include a configuration cache, 
which would provide a faster reconfiguration. 

5.4.1. Configuration Prefetching. Performance is improved when the 
actual configuration of the hardware is overlapped with computations 
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performed by the host processor, because programming the reconf igurable 
hardware requires from milliseconds to seconds to accomplish. Overlapping 
configuration and execution prevents the host processor from stalling while 
it is waiting for the configuration to finish, and hides the configuration 
time from the program execution. Configuration prefetching (Hauck 1998a) 
attempts to leverage this overlap by determining when to initiate 
reconfiguration of the hardware in order to maximize overlap with useful 
computation on the host processor. It also seeks to minimize the chance 
that a configuration will be prefetched falsely, overwriting the 
configuration that is actually used next. 

5.4.2. Configuration Compression. Unfortunately, there will always be 
cases in which the configuration overheads cannot be successfully hidden 
using a prefetching technique. This can occur when a conditional branch 
occurs immediately before the use of a configuration, potentially making a 
100% correct prefetch prediction impossible, or when multiple 
configurations or contexts must be loaded in quick succession. In these 
cases, the delay incurred is minimized when the amount of data transferred 
from the host processor to the reconf igurable array is minimized. 
Configuration compression can be used to compact this configuration 
information (Hauck et al . 1998b; Hauck and Wilson 1999; Li and Hauck 1999; 
Dandalis and Prasanna 2001) . 

One form of configuration compression has already been implemented in 
a commercial system. The Xilinx 6200 series of FPGA (Xilinx 1996) contains 
wildcarding hardware, which provides a method to program multiple logic 
cells with a single address and data value. This is accomplished by setting 
a special register to indicate which of the address bits should behave as 
"don't-care" values, resolving to multiple addresses for configuration. For 
example, suppose two configuration addresses, 00010 and 00110, are both to 
be programmed with the same value. By setting the wildcard register to 
00100, the address value sent is interpreted as 00X10 and both these 
locations are programmed using either of the two addresses above in a 
single operation. Hauck et al . (1998b) discuss the benefits of this 
hardware, while Li and Hauck (1999) cover a potential extension to the 
concept, where "don't care" values in the configuration stream can be used 
to allow areas with similar but not identical configuration data values to 
also be programmed simultaneously. 

Within partially reconf igurable systems, there is an added potential 
to compress effectively the amount of data sent to the reconf igurable 
hardware. A configuration can possibly reuse configuration information 
already present on the array, such that only the areas differing in 
configuration values must be reprogrammed . Therefore, configuration time 
can be reduced through the identification of these common components and 
the calculation of the incremental configurations that must be loaded (Luk 
et al. 1997a; Shirazi et al . 1998). 

Alternately, similar operations can be grouped together to form a 
single configuration that contains extra control circuitry in order to 
implement the various functions within the group (Kastrup et al . 1999) . By 
creating larger configurations out of groups of smaller configurations, the 
configuration overhead of partial reconfiguration is reduced because more 
operations can be present on chip simultaneously. However, there are some 
area and execution penalties imposed by this method, creating a trade-off 
between reduced reconfiguration overhead and faster execution with a 

5.4.3. Relocation and Def ragmentation in Partially Reconf igurable 
Systems. Partial) y reconf igurable systems have the advantage over single 
context systems in that they allow a new configuration to be written to the 
programmable logic while the configurations not occupying that same area 
remain intact and available for future use. Because these configurations 
will not have to be reconfigured onto the array, and because the 
programming of a single configuration can require the transfer of far less 
configuration data than the programming of an entire context, a partially 
reconf igurable system can incur less configuration overhead than a single 
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context FPGA. 

supposed to be located at overlapping physical locations on the FPGA. If 
these configurations are repeatedly used one after another, they must be 
swapped in and out of the array each time. This type of conflict could 
negate much of the benefit achieved by partially reconf igurable systems. A 
better solution to this problem is to allow the final placement of the 
configurations to occur at run-time, allowing for run-time relocation of 
those configurations (Li et al . 2000b; Compton et al . 2002). Using 
relocation, a new configuration may be placed onto the reconf igurable array 
where it will cause minimum conflict with other needed configurations 
already present on the hardware. A number of different systems support 
run-time relocation, including Chimaera (Hauck et al . 1997), Garp (Hauser 
and Wawrzynek 1997), and PipeRench (Cadambi et al . 1998; Goldstein et al . 
2000) . 

Even with relocation, partially reconf igurable hardware can still 
suffer from some placement conflicts that could be avoided by using an 
additional hardware optimization. Over time, as a partially reconf igurable 
device loads and unloads configurations, the location of the unoccupied 
area on the array is likely to become fragmented, similar to what occurs in 
memory systems when RAM is allocated and deallocated. There may be enough 
empty area on the device to hold an incoming configuration, but it may be 
distributed throughout the array. A configuration normally requires a 
contiguous region of the chip, so it would have to overwrite a portion of a 
valid configuration in order to be placed onto the reconf igurable hardware. 
A system that incorporates the ability to perform def ragmentation of the 
reconf igurable array, however, would be able to consolidate the unused area 
by moving valid configurations to new locations (Diessel and El Gindy 1997; 
Compton et al . 2002) . This area can then be used by incoming 
configurations, potentially without overwriting any of the moved 
configurations . 

5.4.4. Configuration Caching. Because a great deal of the delay 
caused by configuration is due to the distance between the host processor 
and the reconf igurable hardware, as well the reading of the configuration 
data from a file or main memory, a configuration cache can potentially 
reduce the costs of reconfiguration (Deshpande et al . 1999; Li et al. 
2000b) . By storing the configurations in fast memory near to the 
reconf igurable array, the data transfer during reconfiguration is 
accelerated, and the overall time required is reduced. Additionally, a 
special configuration cache can allow for specialized direct output to the 
reconf igurable hardware (Compton et al . 2000) . This output can leverage the 
close proximity of the cache by providing high-bandwidth communications 
that would facilitate wide parallel loading of the configuration data, 
further reducing configuration times. 

5.5. Potential Problems with RTR 

Partial reconfiguration involves selectively programming portions of 
the reconf igurable array. However, in many architectures, there are some 
routing resources that traverse long distances, and may traverse areas 
allocated to different configurations. Care must be taken such that 
different configurations do not attempt to drive to these wires 
simultaneously, as multiple drivers to a wire can potentially damage the 
hardware. Therefore, systems such as the Xilinx 6200 (Xilinx 1996) and 
Chimaera (Hauck et al . 1997) have specially designed routing resources that 
prevent multiple drivers. LEGO (Chow et al. 1999b) includes an additional 
control signal preventing conflicts during the span of time between startup 
and actual programming of the hardware. 

An additional difficulty in using runtime reconf igurable systems 
occurs when the host processor runs multiple threads or processes. These 
threads or processes may each have their own sets of configurations that 
are to be mapped to the reconf igurable hardware. Issues such as the correct 
use of memory protection and virtual memory must be considered during 
memory accesses by the reconf igurable hardware (Chien and Byun 1999; Jacob 
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and Chow 1999; Jean et al . 1999) . Another problem can occur when one thread 
or process configures the hardware, which is then reconfigured by a 
different thread or process. Threads and processes must be prevented from 
incorrectly calling hardware functions that no longer appear on the 
reconf igurable hardware. This requires that the state of the reconf igurable 
hardware be set to "dirty" on a main processor context switch, or re-loaded 
with the correct configuration context. 

Partially reconf igurable systems must also protect against 
inter-process of inter-thread conflicts within the array. Even if each 
application has ensured that their own configurations can safely co-exist, 
a combination of configurations from different applications re-introduces 
the possibility of inadvertently causing an electrical short within the 
reconf igurable hardware. This particular issue can be solved through the 
use of an architecture that does not have "bad" configurations, such as the 
6200 series (Xilinx 1996) and Chimaera (Hauck et al . 1997). The potential 
for this type of conflict also introduces the possibility of extremely 
destructive configurations that can destroy the system's underlying 
hardware . 

5.6. Run-Time Reconfiguration Summary 

We have discussed the benefits of using run-time reconfiguration to 
increase the benefits gained through reconf igurable computing. Different 
configurations may be used at different phases of a program's execution, 
customizing the hardware not only for the application, but also for the 
different stages of the application. Runtime reconfiguration also allows 
configurations larger than the available reconf igurable hardware to be 
implemented, as these circuits can be split into several smaller ones that 
are used in succession. Because of the delays associated with 
configuration, this style of computing requires that reconfiguration be 
performed in a very efficient manner. Multicontext and partially 
reconf igurable FPGAs are both designed to improve the time required for 
reconfiguration. Hardware optimizations, such as wildcarding, run-time 
relocation, and def ragmentation, further decrease configuration overhead in 
a partially reconf igurable design. Software techniques to enable fast 
configuration, including prefetching and incremental configuration 
calculation, were also discussed. 

6 . CONCLUSION 

Reconf igurable computing is becoming an important part of research in 
computer architectures and software systems. By placing the computationally 
intense portions of an application onto the reconf igurable hardware, that 
application can be greatly accelerated. This is because reconf igurable 
computing combines many of the benefits of both software and ASIC 
implementations. Like software, the mapped circuit is flexible, and can be 
changed over the lifetime of the system or even the lifetime of the 
application. Similar to an ASIC, reconf igurable systems provide a method to 
map circuits into hardware. Reconf igurable systems therefore have the 
potential to achieve far greater performance than software as a result of 
bypassing the f etch-decode-execute cycle of traditional microprocessors as 
well as possibly exploiting a greater degree of parallelism. 

Reconf igurable hardware systems come in many forms, from a 
configurable functional unit integrated directly into a CPU, to a 
reconf igurable coprocessor coupled with a host microprocessor, to a 
multi-FPGA stand-alone unit. The level of coupling, granularity of 
computation structures, and form of routing resources are all key points in 
the design of reconf igurable systems. The use of heterogeneous structures 
can also greatly add to the overall performance of the final design. 

Compilation tools for reconf igurable systems range from simple tools 
that aid in the manual design and placement of circuits, to fully automatic 
design suites that use program code written in a high-level language to 
generate circuits and the controlling software. The variety of tools 
available allows designers to choose between manual and automatic circuit 
creation for any or all of the design steps. Although automatic tools 
greatly simplify the design process, manual creation is still important for 
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performance-driven applications. Circuit libraries and circuit generators 
are additional software tools that enable designers to quickly create 
efficient designs. These tools attempt to aid the designer in gaining the 
benefits of manual design without entirely sacrificing the ease of 
automatic circuit creation. 

Finally, run-time reconfiguration provides a method to accelerate a 
greater portion of a given application by allowing the configuration of the 
hardware to change over time. Apart from the benefits of added capacity 
through the use of virtual hardware, run-time reconfiguration also allows 
for circuits to be optimized based on runtime conditions. In this manner, 
performance of a reconf igurable system can approach or even surpass that of 
an ASIC. 

Reconf igurable computing systems have shown the ability to accelerate 
program execution greatly, providing a high-performance alternative to 
software-only implementations. However, no one hardware design has emerged 
as the clear pinnacle of reconf igurable design. Although general-purpose 
FPGA structures have standardized into LUT-based architectures, groups 
designing hardware for reconf igurable computing are currently also 
exploring the use of heterogeneous structures and word-width computational 
elements. Those designing compiler systems face the task of improving 
automatic design tools to the point where they may achieve mappings 
comparable to manual design for even high-performance applications. Within 
both of these research categories lies the additional topic of runtime 
reconfiguration. While some work has been done in this field as well, 
research must continue in order to be able to perform faster and more 
efficient reconfiguration. Further study into each of these topics is 
necessary in order to harness the full potential of reconf igurable 
computing . 

(1) The term "SRAM" is technically incorrect for many FPGA 
architectures, given that the configuration memory mayor may not support 
random access. In fact, the configuration memory tends to be continually 
read in order to perform its function. However, this is the generally 
accepted term in the field and correctly conveys the concept of static 
volatile memory using an easily understandable label. 
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